Chapter 4
IN THIS CHAPTER
Examining the evolution of statistical software
Surveying commercial, open source, and free options
Considering code-based versus non–code-based software
Storing data in the cloud
Before statistical software, complex regressions we could do in theory were too complicated to do manually using real datasets. It wasn’t until the 1960s with the development of the SAS suite of statistical software that analysts were able to do these calculations. As technology advanced, different types of software were developed, including open-source software and web-based software.
As you may imagine, all these choices led to competition and confusion among analysts, students, and organizations utilizing this software. Organizations wonder what statistical packages to implement. Professors wonder which ones to teach, and students wonder which ones to learn. The purpose of this chapter is to help you make informed choices about statistical software. We describe and provide guidance regarding the practical choices you have today among the statistical software available. We discuss choosing between:
We also provide guidance on how to choose between code-based and non–code-based software, and end by providing advice on cloud data storage.
The first widespread commercial statistical software invented is called SAS, and it is still used today. SAS was developed originally in the 1960s and 1970s to run on mainframe computers. Around 2000, SAS was adapted to personal computers (known as PC SAS), adding a user-friendly graphical user interface (GUI). During the growth of SAS, other commercial statistical packages appeared, the most popular being IBM’s SPSS. SAS continues to be the go-to program for big data analysis, where analysts can easily access large datasets from servers. In contrast, SPSS continues to be used on a personal computer like PC SAS.
If you were to take a college statistics course in the year 2000, your course would have likely taught either SAS or SPSS. Professors would have made either SPSS or SAS available to you for free or for a nominal license fee from your college bookstore. If you take a college statistics course today, you may be in the same situation — or, you may find yourself learning so-called open-source statistical software packages. The most common are R and Python. This software is free to the user and downloadable online because it is built by the user community, not a company.
As the Internet evolved, more options became available for statistical software. In addition to the existing stand-alone applications described earlier, specialized statistical apps were developed that only perform one or a small collection of specific statistical functions (such as G*Power and PS, which are for calculating sample sizes). Similarly, web-based online calculators were developed, which are typically programmed to do one particular function (such as calculate a chi-square statistic and p value from counts of data, as described in Chapter 12). Some web pages feature a collection of such calculators.
Before 2010, if an organization performed statistical analysis as part of its core function, it needed to purchase commercial statistical software like SAS or SPSS. Advantages of implementing commercial software include the ability to perform many statistical functions, technical support from the software company, and the expectation that the software will remain in use in the future as the company continues to support and upgrade it.
However, organizations today are hesitant to adopt commercial software when they can instead use open-source software like R or Python. Admittedly, even though it is free of charge, there are many downsides to open-source software. First, you need to hire analysts who know how to use it so well that they can figure out what to do when there’s a problem because open-source software does not have tech support. Next, you need to hire a lot more analysts than you would with commercial software because a lot of their work will be in trying to customize the software for your use and keep it updated so that your organization runs smoothly.
So, why are new organizations today hesitant to adopt commercial software when open-source software has so many downsides? The main reason is that the old advantages of commercial software are not as true anymore. SAS and SPSS are expensive programs, but they have much of the same functionality as open-source R and Python, which are free. In some cases, analysts prefer the open-source application to the commercial application because they can customize it more easily to their setting. Also, it is not clear that commercial software is innovating ahead of open-source software. Organizations do not want to get entangled with expensive commercial software that eventually starts to perform worse than free open-source alternatives!
In the following sections, we discuss the most popular commercial statistical software available currently.
SAS is the oldest commercial software currently available. It started out as having two main components — Base SAS and SAS Stat — that provided the most used statistical calculations. However, today, it has grown to include many additional components and sublanguages. SAS has always been so expensive that only organizations with a significant budget can afford to purchase and use it. However, because individual learners need to be able to practice SAS even if they cannot afford it, SAS developed a free, online version called SAS OnDemand for Academics (ODA) that is available at https://welcome.oda.sas.com.
Originally, SAS ran as a command-prompt software without a guided user interface, or GUI, which came later in the 2000s when PC SAS was invented. In the original SAS, the user would gain access to datasets in SAS format that resided on a SAS server in the same environment. The user would write code files using SAS code and run these files against the SAS data. This action would produce a log file that explained how the code was executed and reported any errors. It would also produce output that provided the results of the statistical procedures.
Today, the experience of using SAS has been modernized. In PC SAS and SAS ODA, it is easy to view code, log, and output files in different windows and switch back and forth between them. It is also easier to import data into and out of the SAS environment and create integrated application pipelines involving the SAS environment. The new commercial cloud-based version of SAS called Viya is intended to be used with data stored in the cloud rather than on SAS servers (see the later section “Storing Data in the Cloud” for more).
SAS is entrenched in some industries, such as pharmaceutical, insurance, and banking, because SAS has historically been the only program powerful enough to handle the size of their datasets. Those settings traditionally used SAS servers for data storage. Now, this practice is being challenged because other analytic options may look more appealing than what SAS has to offer (see the section “Focusing on open-source and free software”). In addition, many companies are having trouble maintaining their old-fashioned SAS servers and want to move their data to cloud storage. These industries are looking for SAS users to help them modernize their operations.
SPSS was invented more recently than SAS and runs in a fundamentally different way. SPSS does not expect you to have a data server the way SAS does. Instead, SPSS runs as a stand-alone program like PC SAS, and expects you to import data into it for analysis. Therefore, SAS is more likely to be used in a team environment, while SPSS tends to have individual users.
Like SAS, SPSS produces output, but unlike SAS, SPSS is typically manipulated by the user through selections in menus rather than through writing code and running it. SPSS produces one long output file that includes all the output from each SPSS session. In the output file, SPSS includes code it writes automatically from the way you manipulate the menu. Therefore, like with SAS, it is possible to save SPSS code files and output files and rerun the same code later. SPSS is available from IBM’s website at www.ibm.com/products/spss-statistics/pricing.
Microsoft Excel has been used in some domains for statistical calculations, but it is difficult to use with large datasets. Excel has built-in functions for summarizing data (such as calculating means and standard deviations talked about in Chapter 9). It also has common probability distribution functions such as Student t (Chapter 11) and chi-square (Chapter 12). You can even do straight-line regression (Chapter 16), as well as more extensive analyses available through add-ins.
Microsoft Excel is available in different formats, including both downloadable and web based. Purchase it from Microsoft at www.microsoft.com.
A more modern approach to statistical software is to create an online platform known as an analytics suite that allows you to connect to data sources and conduct analytics online. Here are a few popular online platforms:
www.tableau.com.www.graphpad.com.Open-source software refers to software that has been developed and supported by a user community. Although open-source software has licenses, they are typically free but require you to adhere to certain policies when using the software. In this section, we talk about the two most popular open-source statistical software packages: R and Python.
The two most popular and extensive open-source statistical programs are R and Python.
https://cran.r-project.org.www.python.org/downloads.Other statistical software packages are free, but they are not technically open-source — meaning they were not developed by an open-source community, and they are not licensed the same way.
This section provides examples of free software that performs many functions like SAS and R.
https://openstat.info.https://www.cdc.gov/epiinfo/index.html.Biostatisticians frequently encounter the problem of estimating sample size. The following are two free applications we recommend for performing sample-size calculations:
www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower. To use the program, you download it from this website and install it on your computer.https://biostat.app.vumc.org/wiki/Main/PowerSampleSize.Most of the software mentioned up to this point in this chapter — including SAS, SPSS, R and Python — use code files that can be saved and rerun on data at a later date. These programs run fundamentally differently from programs such as Microsoft Excel, where you can run statistics on data, but no code files are produced and saved. Also, when you use web-based calculators, specialized apps like G*Power and PS for sample-size calculations, or online commercial platforms, no code files are produced and saved.
Cloud-based storage refers to storing large data files on a set of Internet servers designed specifically for large data storage. Unlike old-fashioned stand-alone servers in server rooms, cloud-based servers share storage space across the Internet, providing instantaneous access and back-up capabilities. If you want to get rid of an old-fashioned server in your server room (that could be a SAS server), you will have to contract with a cloud-based storage company to use its space for your data. Then, you will have to find a way to move your data from your server into your new cloud storage. You will also have to make sure you want to have a long-term relationship with this company, so you don’t have to move your data out anytime soon.